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Abstract 

The dynamics of tumour evolution are not well understood. In this paper 
we provide a statistical framework for evaluating the molecular variation 
observed in different parts of a colorectal tumour. A multi-sample version 
of the Ewens Sampling Formula forms the basis for our modelling of 
the data, and we provide a simulation procedure for use in obtaining 
reference distributions for the statistics of interest. We also describe 
the large-sample asymptotics of the joint distributions of the variation 
observed in different parts of the tumour. While actual data should be 
evaluated with reference to the simulation procedure, the asymptotics 
serve to provide theoretical guidelines, for instance with reference to the 
choice of possible statistics. 

AMS subject classification (MSC2010) 92D20; 92D15, 92C50, 60C05, 
62E17 



1 Introduction 

Cancers are thought to develop as clonal expansions from a single trans- 
formed, ancestral cell. Large-scale sequencing studies have shown that 
cancer genomes contain somatic mutations occurrin g in many genes; cf. 
Greenman et al. Q, Sjoblom et al. [2^, Shah et al. jl6|. Many of these 
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mutations are thought to be passenger mutations (those that are not 
driving the behaviour of the tumour), and some are pathogenic driver 
mutations that influence the growth of the tumour. The dynamics of 
tumour evolution are not well understood, in part because serial obser- 
vation of tumour growth in humans is not possible. 

In an attempt to better understand tumour growth and structure, a 
number of evolutionary approaches have been described. Merlo et al. 



151 give an excellent overview of the field. Tsao et al. [2l| used non- 



coding microsatellite loci as molecular tumour clocks in a number of 
human mutator phenotype colorectal tumours. Stochastic models of tu- 
mour growth and statistical inference were used to estimate ancestral 
features of the tumours, such as their age (defined as the time to loss of 
mismatch repair). Campbell et al. Q used deep sequencing of a DNA re- 
gion to characterise the phylogenetic relationships among clones within 
patients with B-cell chronic lymphocytic leukaemia. Siegmund et al. [l^ 
used passenger mutations at particular CpG sites to infer aspects of the 
evolution of colorectal tumours in a number of patients, by examining 
the methylation patterns in different parts of each tumour. 

The problem of comparing the molecular variation present in different 
parts of a tumour is akin to the following problem from population ge- 
netics. Suppose that R observers take samples of sizes ni, . . . , rifl from 
a population, and record the molecular variation seen in each member 
of their sample. If the population were indeed homogeneous, it makes 
sense to ask about the relative amount of genetic variation seen in each 
sample. For example, how many genetic types are seen by all the ob- 
servers, how many are seen by a single observer, and so on. Ewens et al. 
discuss this problem in the case of i? = 2 observers; the methodo- 
logical contribution of the present paper addresses the case of multiple 
observers. The theory is used to study the spatial organization of the 



colorectal tumours studied in Siegmund et al. 18 1. 

This paper is organized as follows. In Section[2]we describe the tumour 
data that form the motivation for our work. The Ewens Sampling For- 
mula, which forms the basis for our modelling of the data, is described 
in Section [31 together with a simulation procedure for use in obtaining 
reference distributions for the statistics of interest. The procedure for 
testing whether the observers are homogeneous among themselves is il- 
lustrated in Sectional The remainder of the paper is concerned with the 
large-sample asymptotics of the joint distributions of the allele counts 
from the different observers. While actual data should be evaluated with 
reference to the simulation procedure, the asymptotics serve to provide 
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Figure 2.1 Left panel: sampling illustrated from three glands from 
one side of a colorectal tumour. Each gland contains 2,000-10,000 
cells. Right panel: Methylation data from the BGN locus from 7 
glands from the left side of Cancer 1 (CNCl, from [IBl)- 8 cells 
are sampled from each gland. Each row of 9 circles represents the 
methylation pattern in a cell. Solid circles denote methylated sites, 
open circles unmethylated. See Table [2TT] for further details. 



theoretical guidelines, for instance with reference to the choice of pos- 
sible statistics. 



2 Colorectal cancer data 

In this section we describe the colorectal cancer data that motivate the 



ensuing work. Yatabe et al. j24l | describe an experimental procedure for 
sampling CpG DNA methylation patterns from cells. These methylation 
patterns change during cell division, due to random mutational events 
that result in switching an unmethylated site to a methylated one, or vice 
versa. The methylation patterns obtained from a particular locus may 
be represented as strings of binary outcomes, a 1 denoting a methylated 
site and a an unmethylated one. 

Siegmund et al. [3l studied 12 human colorectal tumours, each taken 
from male patients of known ages. Samples of cells were taken from 7 
different glands from each of two sides of each tumour, and the methyl- 
ation pattern at two neutral (passenger) CpG loci (BGN, 9 sites; and 
LOG, 14 sites; both are on the X chromosome) was measured in each of 
8 cells from each gland. Figure I^TT] illustrates the sampling, and depicts 
the data from the left side of Cancer 1. 

Data obtained from methylation patterns may be compared in several 
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Tabic 2.1 Data for Cancer 1. 13 alleles were observed in the 7 
samples. Columns labelled 1-7 give the distribution of the alleles 
observed in each sample, and column 8 shows the combined data. 
Data from cancer CNCl in [l8l | . 
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ways. We focus on the simplest method that considers whether or not 
cells have the same allele (that is, an identical pattern of Os and Is). 
Here we do not exploit information about the detailed structure of the 
methylation patterns, for which the reader is referred to In Table 
I2.1l we present the data from Cancer 1 shown in Figure [01 in a different 
way. The body of the table shows the numbers of cells of each allele (or 
type) in each of the 7 samples. The third row of the Table shows the 
numbers Ki of different alleles seen in each sample. In Table we give 
a similar breakdown for data from the left side of Cancer 2. 

The last column in Tables [^?T] and gives the combined distribution 
of allelic variation at this locus in the two tumours. Qualitatively, the 
two tumours seem to have rather different behaviour: Cancer 1 has far 
fewer alleles than Cancer 2, and their allocation among the different 
samples is more homogeneous in Cancer 1 than in Cancer 2. In the next 
sections we develop some theory that allows us to analyse this variation 
more carefully. 
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Tabic 2.2 Data for Cancer 2. 27 alleles were observed in the 7 
samples. Columns labelled 1-7 give the distribution of the alleles 
observed in each sample, and column 8 shows the combined data. 
Data from cancer COCl in [isj . 
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3 The Ewens sampling formula 

Our focus is on identifying whether the data are consistent with a uni- 
formly mixing collection of tumour cells that are in approximate stasis, 
or are more typical of patterns of growth such as described in Siegmund 
et al. it], [iSl, ll9| . Whatever the model, the basic ingredients that must 
be specified include how the cells are related, the details of which depend 
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on the demographic model used to describe the tumour evolution, and 
the mutation process that describes the methylation patterns. A review 
is provided in Siegmund et al. [l^ . We use a simple null model in which 
the population of cells is assumed to have evolved for some time with an 
approximately constant, large size of N cells, the constancy of cell num- 
bers mimicking stasis in tumour growth. The mutation model assumes 
that in each cell division there is probability u of a mutation resulting in 
a type that has not been seen before — admittedly a crude approximation 
to the nature of methylation mutations arising in our sample. The muta- 
tions are assumed to be neutral, a reasonable assumption given that the 
BGN gene is expressed in connective tissue but not in the epithelium. 
Thus our model is a classical one from population genetics, the so-called 
infinitely-many- neutral-alleles model. 

Under this model the distribution of the types observed in the com- 
bined data (i.e., the allele counts derived from the right-most columns 
of data from Tables 12.11 and \2.'2\ has a distribution that depends on the 
parameter 9 = 2Nu. This distribution is known as the Ewens Sampling 
Formula Q, denoted by ESF(6'), and may be described as follows. For 
a sample of n cells, we write (Ci, C2, . . . , C„) for the vector of counts 
given by 

Cj ~ number of types represented j times in the sample, 

where Ci +2C2 + - • • + nC„ = n. For the Cancer 1 sample wc have n = 56 
and 

Ci = 6, C2 3, C3 — l,Cs ~ 1, C16 ~ 1, Ci7 = 1, 
whereas for Cancer 2 we also have n = 56, but 

Ci = 17, C2 = 5, C3 = 2, C5 = 1, C'e = 1, C12 = 1. 
The distribution ESF(6') is given by 

I " / fl \ 1 
P[Ci=ci,...,C„ = c„] = ^n - — ' (3.1) 



for ci + 2c2 + ■ ■ ■ + ncn = n and 0(„) := 9(6 + 1) .. .{d + n — 1). An explicit 
connection between mutations resulting in the ESF and the ancestral 
history of the individuals (cells) in the sample is provided by Kingman's 



coalescent 13|, [l^ , and the connection with the infinite population limit 
is given in Kingman [ll. 14|. 

We recall from Q that K — Kn := Ci + ■ • ■ + C„, the number of types 
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in the sample, is a sufficient statistic for 0, and that the maximum- 
hkehhood estimator of 6 is the solution of the equation 



The conditional distribution of the counts Ci , . . . , C„ given Kn does 
not depend on 9, and thus may be used to assess the goodness-of-fit of 
the model. 



So far, we have described the distribution of variation in the entire 
sample, rather than in each of the subsamples from the different glands 
separately. The joint law of the counts of different alleles seen in the R 
glands (that is, by the R observers) is precisely that obtained by tak- 
ing a hypergeometric sample of sizes ni, n2, ■ ■ ■ , from the n cells 
in the combined sample. It is a consequence of the consistency property 
of the ESF that the sample seen by each observer i has its own ESF, 
with parameters n.i and 0, i = 1, 2, . . . , i?. Tables E?T1 and 1^?^ give the 
observed values for the two tumour examples. 

We are interested in assessing the goodness-of-fit of the tumour data 
subsamples to our simple model of a homogeneous tumour in stasis. Be- 
cause Kn is sufficient for 9 in the combined sample, this can be performed 
by using the joint distribution of the counts seen by each observer, con- 
ditional on the value of Kn. To simulate from this distribution we use 
the Chinese Restaurant Process, as described in the next section. 



We use simulation to find the distribution of certain test statistics re- 
lating to the multiple observer data. To do this we exploit a simple way 
to simulate a sample of individuals (cells in our example) whose allele 
counts follow the ESF(0). The method, known as the Chinese Restaur- 
ant Process (CRP), after Diaconis and Pitman Q, simulates individuals 
in a sample sequentially. The first individual is given type 1. The second 
individual is either a new type (labelled 2) with probability 9/{9 + 1), 
or a copy of the type of individual 1, with probability 1/(0 + 1). Sup- 
pose that k — 1 individuals have been assigned types. Individual k is 
assigned a new type (the lowest unused positive integer) with probabil- 
ity 6*/ (0 + fc — 1), or is assigned the type of one of individuals 1, 2, . . . , 




n 



(3.2) 



3.1 The multi-observer ESF 



3.2 The Chinese Restaurant Process 
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k — \ selected uniformly at random. Continuing until k = n produces 
a sample of size n, and the joint distribution of the number of types 
represented once, twice, ... is indeed ESF(6'). 

Once the sample of n individuals is generated, it is straightforward to 
subsample without replacement to obtain R samples, of sizes ni, . . . , nn, 
in each of which the distribution of the allele counts follows the ESF(0) of 
the appropriate size. This may be done sequentially, choosing ni without 
replacement to be the first sample, then n2 from the remaining n — ni 
to form the second sample, and so on. 

When samples of size n arc required to have a given number of alleles, 
say Kn ~ k, this is most easily arranged by the rejection method: the 
CRP is run to produce an n-sample, and that run is rejected unless 
the correct value of k is observed. Since conditional on iC„ = k the 
distribution of the allele frequencies is independent of 9, we have freedom 
to choose 6, which may be taken as the MLE 9 determined in (|3.2p to 
make the rejection probability as small as possible. 



We have noted that the combined data in the i?-observer ESF have the 
ESF(0) distribution with sample size n = ni + ■ ■ ■ + nn, while the ith 
observer's sample has ESF(0) distribution with sample size n^. Of course, 
these distributions are not independent. To test whether the combined 
data are consistent with the ESF, we may use a statistic suggested by 
Watterson [2^, based on the distribution of the sample homozygosity 



found after conditioning on the number of types seen in the combined 
sample. Each marginal sample may be tested in a similar way using the 
appropriate value of n. 

Since our cancer data arise as the result of a spatial sampling scheme, 
it is natural to consider statistics that are aimed at testing whether 
the samples can be assumed homogeneous, that is, are described by 
the multi-observer ESF. Knowing the answer to this question would aid 
in understanding the dynamics of tumour evolution, which in turn has 
implications for understanding metastasis and response to therapy. 

To assess this, we use as a simple illustration the sample variance of 



4 Analysis of the cancer data 
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the numbers of types seen in each sample. The statistic may be written 
as 

R 

Q - j^,n^<^-r<f - E (^-^.)^ (4-1) 

the latter expression emphasizing its role as a measure of the average 
discrepancy between samples. In the next paragraphs, we discuss the 
structure of Cancers 1 and 2 using these statistics. 

Cancer 1 We begin with a comparison of the data from the two sides 
of Cancer 1. In this case ni = 56, n2 = 56 and the combined sample of 
n = 112 has K112 = 16 and F = 0.237. The 5th and 95th percentiles of 
the null distribution of F found by the conditional CRP simulation de- 
scribed in the last section are 0.108 and 0.277 respectively, suggesting no 
anomaly with the underlying ESF model. For the left side of the cancer 
(Tabic [2?1|), A'56 = 13 and F = 0.209, while for the right side (data not 
shown), K^Q = 10 and F = 0.293. In both cases these observed values of 
F are consistent with the ESF. Wc then use the statistic Q to investig- 
ate whether the data from the 7 glands from the left side of the tumour 
are homogeneous. We observed Q = 2.24, and the null distribution of 
Q can also be found from the conditional CRP simulation. We obtained 
5th and 95th percentiles of 0.29 and 2.48 respectively, supporting the 
conclusion of a homogeneous tumour. 

Cancer 2 The comparison of the two sides of Cancer 2 is more interest- 
ing. Once more ui — 56,ri2 = 56 but the combined sample of n = 112 
now has K112 = 48 and F = 0.081. The 99th percentile of the nuh 
distribution of F is 0.060, suggesting that the ESF model is not ad- 
equate to describe the combined data. At first glance the anomaly can 
be attributed to the data from the right side of the tumour (not shown 
here), for which K^q — 29 and F = 0.105, far exceeding the 99th per- 
centile of 0.089. For the left side (Table (221), F = 0.083, just below the 
95th percentile of 0.084. Thus the left side seems in aggregate to be ad- 
equately described by the ESF model. Further examination of the data 
from the 7 glands reveals a different story. From the third row of Table 
12.21 we calculate Q = 2.95, far exceeding the estimated 99th percentile 
of 2.33. Thus a more detailed view of the way the mutations are shared 
among the glands shows that these data are indeed inconsistent with the 
homogeneity expected in the multi-observer ESF. 
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Of course, many other statistics could have been considered. A natural 
starting point for constructing them would be the numbers of alleles that 
are seen only by a specific subset A of the observers, where A ranges over 
the 2^ — 2 non-empty proper subsets of the R observers. Such statistics 
form the basis of the results in Section [5l 

Rejection of the null hypothesis of the uniformly mixing homogen- 
eous tumour model can occur for many reasons, for example because of 
non-uniform mutation rates, different demography of cell growth, non- 
neutrality of the mutations (which might apply to the BGN locus if in 
fact it were expressed in tissue in the tumour), and unforeseen effects 
of the simple mutation model itself. Which of these hypotheses is most 
likely requires a far more detailed analysis of competing models, as for 
example outlined in 17, 18, 1^. 



5 Poisson approximation 

In this section, we derive Poisson approximations to the joint distribution 
of the numbers of alleles that are seen only by specific subsets A of the 
observers. As mentioned above, functionals of these counts can be used 
as statistics to test for the homogeneity of (subgroups of) observers. 
Our approximations come together with bounds on the total variation 
distance between the actual and approximate distributions. We begin 
with the case of i? = 2 observers, and with the statistic Ki — K2. 



5.1 2 observers 

We write C (Ci,C2, . . .), where Cj = for j > n, and recall Wat- 
terson's result, that (Ci,...,C„) are jointly distributed according to 
C{Zi^ Z2, . ■ . 1 Z„ \ TQn{Z) — n), where {Zj, j > 1) are independent with 
Zi ~ Po{e/i), and 

S 

Trs{c) = JCj: cGZ^, (5.1) 



22|. The sampled individuals can be labelled 1 or 2, according to which 
observer sampled them; under the above model, the ni 1-labcls and ri2 
2-labels are distributed at random among the individuals, irrespective of 
their allelic type. Let Kr denote the number of distinct alleles observed 
by the r-th observer, r = 1, 2. Ewens et al. 0] observed that, in the case 
ni — 712 and for large n, (A'l — A"2)/logn is equivalent to the difference 
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in the estimates of the mutation rate made by the two observers. The 
same is asymptotically true also as n becomes large, if ni/n ^ pi for any 
fixed pi . This motivates us to look for a distributional approximation to 
the distribution of the difference Ki — K2. 

Theorem 5.1 For any ni, n2 and h, 

kb k'p''+^ 



drv{C{K^^K2),C{Pi^P2)) < 



n-1 (6+l)(l-p)' 



for suitable constants k and k' , where Pi and P2 are independent Pois- 
son random variables having means 01og{l/(l — pi)} and 01og{l/(l — 
P2)} respectively, with pr := Ur/n, and where p = max{l — pi,l — 
P2}. The choice b — bn ~ [logn/log(l/p)J gives a bound of order 
O (log n/ {n min{pi , P2 })) ■ 

Proof Group the individuals in the combined sample according to their 
allelic type, and let Mjs denote the number of individuals that were ob- 
served by observer 1 in the s-th of the Cj groups of size j , the remaining 
j — Mjs being observed by observer 2. Define 

Sj := J^m.^^j] and := J^IiM.^^O] 

s=l s-1 

to be the numbers of j-groups observed only by observers 1 and 2, re- 
spectively. Then it follows that 

K1-K2 = S'~S^, 

where S"^ :— ■ The first step in the proof is to show that the 

effect of the large groups is relatively small. 

Note that the probability that an allele which is present j times in the 
combined sample was not observed by observer 1 is 

< (i-piY; 

i=Q 

similarly, the probability that it was not observed by observer 2 is at 
most (1 — P2y ■ Hence, conditional on C, the probability that any of the 
alleles present more than b times in the combined sample is seen by only 
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one 



of the observers is at most 



whatever the value of b. Hence, writing Ub Y^j=ii^] ~ "^j)' '^^'^ 
that 

P[/U-X,^f/,] < 2 E P^EQ < , (5.2) 



where, by Watterson's formula 22| for the means of the component sizes, 
we can take ki := {29 + e''^) if n > 4(6 + 1) (and ki := 9 if 9 > 1). 

To approximate the distribution of Ub, note that, conditional on C, 
the number of 1-labels among the individuals in allele groups of at most 
b individuals has a hypergeometric distribution 

HG (Tofc(C);7ii;n), 

where HG (s; m; n) denotes the number of black balls obtained in s draws 
from an urn containing m black balls out of a total of n. By Theorem 
3.1 of Holmes [l^l, we have 

d,v(HG(To6(C);ni;n),Bi(ro,(C7),pi)) < '^""{C) - 1 

71—1 

Hence, conditional on C, the joint distribution of labels among individu- 
als differs in total variation from that obtained by independent Bernoulli 
random assignments, with label 1 having probability pi and label 2 prob- 
ability p2, by at most (Toh(C) — l)/{n — 1). 

Now, by Lemma 5.3 of Arratia ct al. 0, we also have 

d^,{C{Ci,...,Cb),C{Zu...,Zb)) < — , 

n 

with ce < 4,9(9 + l)/3 if n > 46. Hence, and from it follows that 

d^^mCi, ...,Cb; {Mjs, 1 < s < 1 < J < 6}), 

£{Zi, ...,Zb; {Njs, l<s<Zj,l<j< b})) 
< £^+E(To.(C))-l^ 

n n — 1 

where {Njg; s > 1, 1 < j < 5) are independent of each other and of Zi, 
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. . . , Zb, with Njs ^ Bi {j,p). But now the values of the iV^s, 1 < s < Zj, 
1 < .? < can be interpreted as the numbers of 1-labels assigned to 
each of Zj groups of size j for each 1 < j < 6, again under independ- 
ent Bernoulli random assignments, with label 1 having probability pi 
and label 2 probability p2 ■ Hence, since the Zj are independent Poisson 
random variables, the counts 

Zj Zj 

Tj := Y.I[N,,=j] and T/ := ^/[iV,, =0] 

s=l s=l 

are pairs of independent Poisson distributed random variables, with 
means Oj~^p{ and 0j~^P2, and are also independent of one another. 
Hence it follows that 

b 

Y.^Tj-T]) ^ Pib-P2b, (5.5) 

where Pu and P2b are independent Poisson random variables, with 
means 9 X]j=i j~^Pi and 6 Y^j=i j~^pi^ respectively. Comparing the def- 
initions of Ub and Vb, and combining (j5.4p and ()5.5[) . it thus follows that 

d^,{CiUb),C{Pib-P2b)) < i2l±^, (5.6) 

n — 1 

with k2 — 4:6/3 for n > 46, once again by Watterson's formula jj^ . 

With (|5.2p and (|5.6p . the argument is all but complete; it simply 
suffices to observe that, much as in proving (j5.2p . 

d^.iC{P^),C{Plb)) + d^.mP2),C{P2b)) < ^f^^lQ^^y 

we take fe := 4 V (ce H- fc2) and k' := 2{0 + fci). □ 



5.2 R observers 

The proof of Theorem 1 5 . 1 1 actually shows that the joint distribution of 
and S*^, the numbers of types seen respectively by observers 1 and 
2 alone, is close to that of independent Poisson random variables Pi 
and P2. For i? > 3 observers, we use a similar approach to derive an 
approximation to the joint distribution of the numbers of alleles seen by 
each proper subset A of the R observers. 

Suppose that the r-th observer samples Ur individuals, 1 < r < i?, 
and set n := X^^Li ^r, Pr '■— n^/n. Define the component frequencies in 
the combined sample as before, and set Mjs = m := (mi, . . . , to_r) if the 
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r-th observer sees of the j individuals in the s-th of the Cj groups 
of size j. For any ^ A C [i?], where [R] := {1,2,..., i?}, define 



M 



A] 



|m G 1+: = j, {r: > 1} = A, {r: = 0} = [i?] 

and set 



Our interest hes now in approximating the joint distribution of the 
counts 7^ ^ C [i?]), where S"^ := X;"=i Sf. To do so, we need 

a set of independent Poisson random variables {P^, ^ A C [R]), with 
- Po(A^(6')), where 

Xf{9) -MN{j;p,,...,pR){MAA and X^i9) := ^A/(0); 

i>i 

(5.7) 

here, MN . . . ,PFi.) denotes the multinomial distribution with j tri- 

als and cell probabilities pi, . . . , p^. 

Theorem 5.2 In the above setting, we have 

dJc{{S^, 9^ AC [R])), X Po (X^iO))) 

0#AC[i?] 

kRb , fc^p^+l 



< 



n (6+l)(l-p)' 



where p maxi<r<_R(l —Pr)- Again b = bn = [l^g'T-/ log(l/p)J is a 
choice. 



Proof The proof runs much as before. First, the bound 

"\ n R 



shows that 



^ E c,Y.{i-pry < rJ2 p'^^ 

) j=b+l r=l j=b+l 



0#AC[_R] 



< 



Rkip' 



b+l 



(b+m-p) 



(5.8) 
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where S'^^, := Y.]^^ Sf. Then, by Theor em 4 of Diaconis and Freedman 

d,vfHG(Tob(C);ni,...,nfl;n),MN(ro6(C);pi,...,pfl)) < 

V / n 

from which it follows that 

d^^(L{Cu. . . , Cb; {Mj„ l<s<C,,l<:]< b]), 

C{Zi,. ..,Zi,] {N,,, 1 < s < Zj, 1 < J < b}) 

< £^ + mTo^jc))^ 

n n 

where {Njs; s>l,l<j<b) are independent of each other and of Zi, 
. . . , Zb, with Njs ^ MN . . . ,p_r). Then the random variables 

are independent and Poisson distributed, with means A^j(0) :~ 
E-=iA/(^?),and 

d,v{/:(5(t), 7^ A c [i?]), /:(r(f), ^ a c [i?])} < 

with ^2 := ce + AR6/Z, where 
Finally, much as before. 



d,v(/:((T(f), 7^ A C [i?])), X Po(A^(0)) 



< 



and we can take := 4 V (cg + Rk2) and fc^ := R(d + fci) in the 
theorem. □ 

We note that the Poisson means A^(0) appearing in (|5.7p may be cal- 
culated using an inclusion-exclusion argument. For reasons of symmetry 
it is only necessary to compute A^(6') for sets of the form A = [r] = 
{1, 2, . . . , r} for r = 1, 2, . . . , i? - 1. We obtain 

MNO;pi,...,p«){Al[,,],} = E(-ir' E [Y.V^ ' (5.10) 
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from which the terms X^{d) readily follow as 

1=1 JQ[r],\J\=l \ / 

5.3 The conditional distribution 

In statistical applications, such as that discussed above, the value of 6 
is unknown, and has to be estimated. Defining 

t 

Kstic) := "Y Cj, 

J=S + 1 

the quantity KQn{C) is sufficient for 6, and the null distribution appro- 
priate for testing model fit is then the conditional distribution 

C{{S^,(l>^AC[R])\Kon{C) = k), 

where k is the observed value of i^on (C*) • Hence we need to approximate 
this distribution as well. Because of sufficiency, the distribution no longer 
involves 9. However, for our approximation, we shall need to define means 
for the approximating Poisson random variables ^ Po {X^{0)), as in 
(|5.7p . and these need a value of 9 for their definition. We thus take 
for our approximation, for convenience with 9k '■= fc/logn; the 
MLE given in p.2p could equally well have been used. 

The proof again runs along the same lines. Supposing that the prob- 
abilities pi, . . . , pn are bounded away from 0, we can take b :— fo„ :~ 
[logn/ log(l/p)J in Theorem l5.2l and use (|5.8p to show that it is enough 
to approximate £{{S^y % ^ AC. [R]) \ /von(C) = k). Then, since the ar- 
guments conditional on the whole realization C remain the same when 
restricting C to the set {K^niC) — fc}, it is enough to show that the 
distributions 

'C(C[o,6] I A'o„(C) = k) and Ce^{Z[a^i,-\) 

are close enough, where C[o.;,] (ci, . . . , Cf,), to conclude that the Poisson 
approximation of Theorem 15.21 with 9 = 9^ also holds conditionally on 
{-RroTt(C') = k}. Note also that the event {ifo„(C) = k} has probability 
at least as big as ci{9k)k~^/^ for some positive function ci(-), by (8.17) 
of Arratia et al. Q . 

Defining Xst{9k) X]*=s+i J"^^*^' ^'-'^ prove the key lemma. 
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Lemma 5.3 Fix any e,r] > 0. Suppose that n is large enough, so 
that b + b'^ < n/2. Then there is a constant k such that, uniformly 
for e < fc/logn < 1/e, and for c S Z"^ with Kob{c) < log log n and 
Tobic) < bV^, 



P[qO,b] =C[0,b]|if0n(C)=fc] 



Pefc[^[o,6] = C'[o,b]] 



< 



K log log n 

log n 



Proof Since £(Ci, 
that 



, C„) = C{Zi, . . . , Zn\Ton{Z) = n), it follows 



P[C[„^b]=C[o,b]\KoniC) ^k] 

_ P[Kon{C) = k I C[o^b] ^ C[Q^f,]]P[C[Q^b] = C[a.b]] 

P[Kon{C) = k] 

^ P[Kbn{C) = fc - K„b{c) I rob(C) = Tob{c)]P[C[„,b] = C[o,b]] 
P[/io„(C) = k] 

We now use results from §13.10 of Arratia et al. Q- First, as on p. 323, 

Pe, [Kbn{C) = k- Kobic) I TobiC) = Tob{c)] 
= Pg, [Kbn{Z) = fc - Kob{c) I T,,„(Z) = n - ro&(c)], 

and the estimate on p. 327 then gives 

Pg, [Kbn{Z) = fc - Kob{c) I Tbn{Z) - n - To6(c)] 

= Po (A,,„(0fe)){fc - Kob{c) - 1} {1 + 0((logn)-^ log log n)}, (5.12) 

uniformly in the chosen ranges of fc, TQb{c) and Ji'oti(c), because of the 
choice 9 ~ 9k- Then 

Pg^KoniC) ^ k] = Po(Ao„(efc)){fc-l}{l + 0((logn)-i)}, (5.13) 

again uniformly in k, Tob(c) and Kob{c), by Theorem 5.4 of Arratia et al. 
[y. Finally, 



1 



PejTt„(Z) = n - Tobic)] 



Pe,[TbniZ) = n] 

by (4.43), (4.45) and Example 9.4 of [ll, if 6 + < n/2. The lemma now 
follows by considering the ratio of the Poisson probabilities in (|5.12p and 
(15331); note that Ao„(6'fe) - \bn{Ok) = C'(loglogri). □ 



In order to deduce the main theorem of this section, we just need to 
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bound the conditional probabilities of the events {Koi,{C) > 77 log log n} 
and {Tob(C) > b'^^^}, given Kon{C) = k. For the first, note that 

Pg,[Kob{C) > T^loglogn] < ^+PejXofc(^)>??loglogn], (5.14) 

and that Koi,{Z) ^ Po (6**; i^^) with mean of order 0(loglog7z). 

Hence there is an ?/ large enough that 

Pg,[Kob{C) > v\og\ogn] = 0((logn)-5/2), 

uniformly in the given range of k. Since also, from (|5.13p . 

FsdKoniQ^k] > ry'/v/tog^ 

for some 77' > 0, it follows immediately that 

P8,[Kob{C) > r]loglogn\KQ„{C) ^ k] = 0((logn)-2). (5.15) 

The second inequality is similar. We use the argument of (|5.14p to reduce 
consideration to Pg^ [T'ob(Z) > 6''/^], and (4.44) of Arratia et al. [1] shows 
that 

PgMZ)>by^] = 0(6-5/2) ^ 0((log„)-V2); 

the conclusion is now as for (|5.15p . 

In view of these considerations, we have established the following the- 
orem, justifying the Poisson approximation to the conditional distri- 
bution of the {S^, ^ ^ Ac. [i?]), using the estimated value Ok of 9 as 
parameter. 

Theorem 5.4 For any < e < 1, uniformly in e < fc/logn < 1/e, we 
have 

dJmS^, 9^ AC [R]) I J^o„(C) = fc), X Po (A^(0fe))) 

0#AC[i?,] 

^/loglognx 
\ log n J 

Note that the error bound is much larger for this approximation than 
those in the previous theorems. However, it is not unreasonable. From 
(|5.8p . the joint distribution of the is almost entirely determined by 
that of Ci, . . . , Cb- Now £{Kob{C) \ Kon{C) = k) can be expected to be 
close to C{Kq{,{Z) I KoniZ) = k), which is binomial Bi (k,pb^n), where 

_ Z^j^i ^/j ^ log 6 ^ log log n 
Pb,n ■ ^J^^i/j ~ logn ~ log(l/p)logn' 
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On the other hand, from Lemma 5.3 of Q, the unconditional distribution 
of K[)b{C) is very close to that of Koi,{Z), a Poisson distribution. The 
total variation distance between the distributions Po (kp) and Bi (kj)) 
is of exact order p if kp is large (Theorem 2 of Barbour and Hall [3|). 
Since pb_„ >; log log n/ logn, an error of this order in Theorem l5.4l is thus 
in no way surprising. 



We can now compute the mean fi of the approximation to the distri- 
bution of Qj as used in Section 21 obtained by using Theorem 15.41 We 
begin by noting that, using the theorem, 

A: reA.s^A A: r(A,s£A 

is close in distribution to 

Krs~Ksr E E 

A: r£A,s(A A: r(A,s£A 

where ~ Po {X^{9k)), ^ ^ AC [i?], are independent. To compute the 
means 

Xrs := E ^'^d ■= E ^^(^'^) 

A: r£A,s(A A: r^A.seA 

of Krs and Ksr, we note that 

E MNU;p,,...,pr){Maj} 

A: reA.s^A 

^ii-Psy{i-ii-Pr/{i-Ps)y} 

= {1-Psy -{l-Pr-PsV, 

the probability under the multinomial scheme that the r-th cell is non- 
empty but the s-th cell is empty. Thus 

A.„ = Y.-i(^-P^y -i^-Pr-P^y} ^ O,\og{{pr+Ps)/Ps), 

J>1 

and Asr — ^A:log((pr + Ps)/pr)- Then, because Krs and K^r arc inde- 
pendent and Poisson distributed, 

¥j{{Krs — I'^sr)^} ~ {Xrs ^ Xsr)'^ + Xrs + Xsr- 
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This yields the formula 



1 



E ^fe{log(^M)r + ^felog 



( 



{pr+PsY 




H := 



R{R - 1) 



l<r<s<_R 



PrPs 



In particular, if Pr = 1/^ for 1 < r < i?. then fi = 0fclog2, agreeing 
with the observation of Ewens et al. Isl in the case R = 2. 



Our paper is about ancestral inference (albeit in a somatic cell setting 
rather than the typical population genetics one) and Poisson approxim- 
ation. John Kingman has made fundamental and far-reaching contribu- 
tions to both areas. It therefore gives us great pleasure to dedicate it to 
John on his birthday. 
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